library(tidyverse)
library(DT)
library(dplyr)
library(tidyr)
library(readr)
library(ggplot2)
library(plotly)
library(pander)
library(car)
HSS <- read_csv("../../Data/HighSchoolSeniors.csv")
#Remember: select "Session, Set Working Directory, To Source File Location", and then play this R-chunk into your console to read the HSS data into R.
#Clean data
library(tidyr)
HSS <-HSS %>% drop_na()
library(dplyr)
HSS$FamGroup <- case_when(HSS$Home_Occupants <= 3 ~ "Only Child", HSS$Home_Occupants == 4 ~ "Sibling Pair", HSS$Home_Occupants > 4 ~ "Large Family")
#find Q1, Q3, and interquartile range for values in column family
Q1 <- quantile(HSS$Doing_Things_With_Family_Hours, .25, na.rm = TRUE)
Q3 <- quantile(HSS$Doing_Things_With_Family_Hours, .75, na.rm = TRUE)
IQR <- IQR(HSS$Doing_Things_With_Family_Hours, na.rm = TRUE)
#find Q1, Q3, and interquartile range for values in column Friends
Q1.2 <- quantile(HSS$Hanging_Out_With_Friends_Hours, .25, na.rm = TRUE)
Q3.2 <- quantile(HSS$Hanging_Out_With_Friends_Hours, .75, na.rm = TRUE)
IQR.2 <- IQR(HSS$Hanging_Out_With_Friends_Hours, na.rm = TRUE)
#only keep rows in dataframe that have values within 1.5*IQR of Q1 and Q3
HSS_C <- subset(HSS, HSS$Doing_Things_With_Family_Hours> (Q1 - 1.5*IQR) & HSS$Doing_Things_With_Family_Hours< (Q3 + 1.5*IQR))
HSS_C_ <- subset(HSS_C, HSS_C$Hanging_Out_With_Friends_Hours> (Q1 - 1.5*IQR.2) & HSS_C$Hanging_Out_With_Friends_Hours< (Q3 + 1.5*IQR.2))
#view row and column count of new data frame
#dim(no_outliers)
#Filter and get precleaned values
HSS_C_ONly <- filter(HSS_C_, Home_Occupants <= 3)
HSS_C_Sib <- filter(HSS_C_, Home_Occupants > 3 & Home_Occupants <= 4)
HSS_C_Sibs <- filter(HSS_C_, Home_Occupants > 4)
#mean(HSS_C_Sib$Sleep_Hours_Schoolnight, na.rm = TRUE)
#mean(HSS_C_Sibs$Sleep_Hours_Schoolnight, na.rm = TRUE)
#mean(HSS_C_ONly$Sleep_Hours_Schoolnight, na.rm = TRUE)
#mean(HSS_C_ONly$Sleep_Hours_Non_Schoolnight, na.rm = TRUE)
#mean(HSS_C_Sib$Sleep_Hours_Non_Schoolnight, na.rm = TRUE)
#mean(HSS_C_Sibs$Sleep_Hours_Non_Schoolnight, na.rm = TRUE)
#Pre cleaning values
#Null More siblings <- More time with family
#mean(HSS_C_$Doing_Things_With_Family_Hours, na.rm = TRUE)
#17.25506
#mean(HSS_C_Sib$Doing_Things_With_Family_Hours, na.rm = TRUE)
#20.25258
#mean(HSS_C_ONly$Doing_Things_With_Family_Hours, na.rm = TRUE)
#13.28037
#mean(HSS_C_Sibs$Doing_Things_With_Family_Hours, na.rm = TRUE)
#27.96452
#Null Less siblings <- More time with friends
#mean(HSS_C_$Hanging_Out_With_Friends_Hours, na.rm = TRUE)
#19.22778
#mean(HSS_C_Sib$Hanging_Out_With_Friends_Hours, na.rm = TRUE)
#23.17568
#mean(HSS_C_ONly$Hanging_Out_With_Friends_Hours, na.rm = TRUE)
#11.90094
#mean(HSS_C_Sibs$Hanging_Out_With_Friends_Hours, na.rm = TRUE)
#32.10127
Before conducting any visualization or analysis the data was cleaned be removing all significant outliers in both columns being tested. As well as the removal of all N/A responses within this columns. These actions reduced the orginal sample size from 500 observations down to 259 observations. The data was then split into three groups explained below.
I think that having more siblings would result in more time per week spent with family and less time with friends since you have more people in your house to hangout with already. Although it might be interesting to see if for that same reason those with larger families try to leave the house more often so they can escape siblings and hangout with friends. Similarly I feel like those with fewer siblings will spend more time with friends because they have less people around in their house to begin with.
For testing purposes the results will be broken into three groups, Only children (82 observations), sibling pairs (94 observations), and large families (5+ members in the household, 83 observations). Three t-tests will be conducted for each group, comparing each variable between groups, between all groups, meaning six total.
In each test the Null and alternative are as shown below: \[ H_0^1: \mu_1 - \mu_2= 0 \] \[ H_a^1: \mu_1 - \mu_2 \neq 0 \] \[ H_0^2: \mu_1 - \mu_3= 0 \]
\[ H_a^2: \mu_1 - \mu_3 \neq 0 \] \[ H_0^3: \mu_2 - \mu_3= 0 \]
\[ H_a^3: \mu_2 - \mu_3 \neq 0 \]
μ1 representing the Sibling pair mean time spent with family/friends
μ2 representing the Only children mean time spent with family/friends
μ3 representing the Large Families mean time spent with family/friends
Significance level will be α=0.10.
#Plots
p2 <- ggplot(HSS_C_, aes(x=FamGroup, y=Doing_Things_With_Family_Hours,fill = FamGroup))+
geom_boxplot(outlier.shape = NA)+
labs(title="How many hours a week do students spend \n with family based on Family size?", x="Size of Family", y="Hours/Week with Family")+
stat_summary(fun=mean, geom="point", shape=4, size=4, color="red", fill="red") +
coord_cartesian(ylim = c(0, 25))+
theme(plot.title = element_text(hjust = 0.5))
ggplotly(p2)
We can see here (interactive plot) that the sibling pair group actually reported the most time spent with family ~6.6 hours/week, and the large family group reported the least time spent with family ~5.5 hours/week.
pander(t.test(HSS_C_Sib$Doing_Things_With_Family_Hours, HSS_C_ONly$Doing_Things_With_Family_Hours, paired = FALSE, mu = 0, alternative = "two.sided", conf.level = 0.9))
| Test statistic | df | P value | Alternative hypothesis | mean of x | mean of y |
|---|---|---|---|---|---|
| 0.2739 | 171.4 | 0.7845 | two.sided | 6.585 | 6.354 |
pander(t.test(HSS_C_Sib$Doing_Things_With_Family_Hours, HSS_C_Sibs$Doing_Things_With_Family_Hours, paired = FALSE, mu = 0, alternative = "two.sided", conf.level = 0.9))
| Test statistic | df | P value | Alternative hypothesis | mean of x | mean of y |
|---|---|---|---|---|---|
| 1.414 | 175 | 0.1592 | two.sided | 6.585 | 5.452 |
pander(t.test(HSS_C_Sibs$Doing_Things_With_Family_Hours,
HSS_C_ONly$Doing_Things_With_Family_Hours, paired = FALSE, mu = 0, alternative = "two.sided", conf.level = 0.9))
| Test statistic | df | P value | Alternative hypothesis | mean of x | mean of y |
|---|---|---|---|---|---|
| -1.093 | 161 | 0.2759 | two.sided | 5.452 | 6.354 |
par(mfrow=c(1,3))
qqPlot(HSS_C_ONly$Doing_Things_With_Family_Hours)
## [1] 9 28
qqPlot(HSS_C_Sib$Doing_Things_With_Family_Hours)
## [1] 43 41
qqPlot(HSS_C_Sibs$Doing_Things_With_Family_Hours)
## [1] 83 46
p1 <- ggplot(HSS_C_, aes(x=FamGroup, y=Hanging_Out_With_Friends_Hours, tooltip = Hanging_Out_With_Friends_Hours, fill = FamGroup))+
geom_boxplot(outlier.shape = NA)+
labs(title="How many hours a week do students hangout \n with friends based on Family size?", x="Size of Family", y="Hours/Week with Friends")+
stat_summary(fun=mean, geom="point", shape=4, size=4, color="red", fill="red") +
coord_cartesian(ylim = c(0, 30))+
theme(plot.title = element_text(hjust = 0.5))
ggplotly(p1)
Here we can see that the Large family group reported the most time spent with friends ~9 hours/week, and the only child group reported the least time spent with friends ~8.5 hours/week. ### T-Test Results for time spent with Friends vs Family size
#T-Tests for Friends
pander(t.test(HSS_C_Sib$Hanging_Out_With_Friends_Hours, HSS_C_ONly$Hanging_Out_With_Friends_Hours, paired = FALSE, mu = 0, alternative = "two.sided", conf.level = 0.9))
| Test statistic | df | P value | Alternative hypothesis | mean of x | mean of y |
|---|---|---|---|---|---|
| 0.003292 | 170.8 | 0.9974 | two.sided | 8.521 | 8.518 |
pander(t.test(HSS_C_Sib$Hanging_Out_With_Friends_Hours, HSS_C_Sibs$Hanging_Out_With_Friends_Hours, paired = FALSE, mu = 0, alternative = "two.sided", conf.level = 0.9))
| Test statistic | df | P value | Alternative hypothesis | mean of x | mean of y |
|---|---|---|---|---|---|
| -0.548 | 165.9 | 0.5844 | two.sided | 8.521 | 9.048 |
pander(t.test(HSS_C_Sibs$Hanging_Out_With_Friends_Hours, HSS_C_ONly$Hanging_Out_With_Friends_Hours, paired = FALSE, mu = 0, alternative = "two.sided", conf.level = 0.9))
| Test statistic | df | P value | Alternative hypothesis | mean of x | mean of y |
|---|---|---|---|---|---|
| 0.5354 | 161.4 | 0.5931 | two.sided | 9.048 | 8.518 |
## [1] 70 28
## [1] 51 8
## [1] 58 19
In both test sections, all qqplots contain data points outside the accepted range but all given that each tests “n” values were greater than 80 it is assumed conditions for t-tests are met and that the values represent the true population values.
No test showed significant values, the most extreme being that of large families vs siblings pairs with a p value of 0.159.
No test showed significant values, the most extreme again being that of large families vs siblings pairs with a p value of 0.5844.
Given no results show significant differences we fail to reject the Null Hypothesis on all six accounts. Small variations in patterns where observed within the sample set but none of these where significant beyond a level of α=0.16. The conclusion of these results is that regardless of family size, highschool students spend similar amounts of time with friends and with family per week.
The Values for these tests are shown here:
| Test / Group | Large Family | Only Child | Sibling Pair |
|---|---|---|---|
| Hours Spent With Friends | 9.05 | 8.52 | 8.52 |
| Hours Spent With Family | 5.45 | 6.35 | 6.58 |